[main] Add new method to track originating cluster of Backup #613

mallardduck · 2024-10-31T23:27:23Z

This partial solves for #612

However it's important to note that while I wrote an RFE for this too - the root here is seeking to fix a bug. The bug is that "BRO itself doesn't do anything to give indications when In-Place Restore is acceptable to use".

The short version of the fix, is we add a few new status details and expose them via kubectl. And later make a change to rancher/dashboard to ensure UI can also benefit from this new context by adding a column.

To be discussed:

This change is very Backwards Compatible, can we ship a CRD change for the new status field in a 2.10.x release?
- Feels like it's a grey area - but, probably OK.
- k8s seems to be OK with full on schema additions (of spec too) as long as they're optional/nullalble. This is hopefully less impactful since users cannot control status.
- The current result on a cluster w/o CRD change is:
  - is a new warning about originCluster not existing on status,
  - Status conditions are correctly set to represent unknown in-place viability.
Do we want to explore supporting an even more Backwards friendly method of detecting origin ID's with Filename?
- For the best UX, we will account for existing backups with UID in filename. In other words, once the CRD is updated with new status field we'll add the UID we find from the filename (or do nothing if not matching expected filename format).

mallardduck · 2024-11-01T16:57:31Z

@jbiers - So i'm not moving it to in review yet, since this was a very rough first pass try.

However if you can take some time to get familiar with the changes and review the list of concerns I thought of (as well as consider/add your own) - that'd be great help to keep the ball rolling on this. In other words, while I could work on this more I'd love your feedback first before I spend more time on it.

mallardduck · 2024-12-09T17:34:54Z

@alexandreLamarre - I've added you as a reviewer her as I'd like to get your take on this. When you have time LMK if you'd like to sync up and chat about in a huddle. Or feel free to review in your own time and give me feedback here.

alexandreLamarre

Mostly some small comments about making the code more simple,readable & testable.

Since this PR could be a stepping stone to fully allowing backups of backup CRDs themselves, storing the cluster origin metadata in the Status feels wrong.

But if we do intend to fully support taking backups of backup CRDs, then even storing the cluster origin metadata in the Spec section of the CRD feels wrong.

What happens when we have a list of remote backups say A, B, C, all taken from different origins?

If the current cluster matches origin C then our status will indicate the restore is viable even if they specify an older backup A or B. I know this hypothetical would be rare but I think it points to the fact that maybe we should think about this differently.

The current design labels the recurring backups as coming from this cluster only, whereas specific backups are specific to specific clusters

What if we embedded cluster origin metadata not in the CRDs, but the backup data itself?

This way the flow looks like

Each remote backup contains a cluster origin
Each restore pointing to a specific backup can determine before actually restoring objects if it is viable
We can add a flag to the restore spec to allow unsafe backups, ones where the cluster origin doesn't match
By default, we disallow cases where we could have unsafe restores
If the restore is unsafe, the status is surfaced through the restore CR
Users can then modify the restore CR to restore the backup even if it is unsafe after the fact, or opt in before creating the restore CR

From the end user's perspective it is somewhat clunkier to take a restore, then be informed that it is potentially unsafe, but as an end user I value being shown correct information more

Additionally storing, the cluster origin metadata in the backup data, allows us to add tests for each corner case in the e2e tests when taking a restore, producing a potentially less volatile new feature

alexandreLamarre · 2025-01-06T17:33:02Z

pkg/controllers/backup/cluster_origin.go

+	backupName                     string
+	hasClusterOriginID             bool
+	clusterOriginID                string
+	hasCurrentOriginCondition      bool
+	currentOriginCondition         bool
+	canInPlaceRestore              bool
+	hasInPlaceRestoreCondition     bool
+	currentInPlaceRestoreCondition bool


Instead of having boolean fields injected during construction, could encapsulate

type backupClusterMdChecker struct{ backupName string clusterOrigin string b *backupv1.Backup // Any controllers or client runtime needed for methods, if any } func (b backupClusterMd) hasClusterOriginID() bool { // ... } func (b backupClusterMd) hasCurrentOriginCondition() bool { // ... }

This allows us to isolate some more complicated logic, as well as allowing for testing if we make these methods or struct public, which we probably should - because we should be focused on stability.

If making them public, could consider moving them to sub-package, pkg/controllers/backup/origin or something to prevent imports of heavy k8s dependencies in unit tests.

Also greatly simplifies the if statements below IMO.

alexandreLamarre · 2025-01-06T17:44:16Z

pkg/controllers/backup/controller.go

+		dynamicClient:             dynamicInterface,
+		defaultBackupMountPath:    defaultLocalBackupLocation,
+		defaultS3BackupLocation:   defaultS3,
+		canUseClusterOriginStatus: util.VerifyBackupCrdHasClusterStatus(clientSet.ApiextensionsV1()),


would avoid having a function returning a value that uses API calls being passed to the handler. Instead would include the logic inside the function in the setup itself, before the call to Register

See comment below as well.

alexandreLamarre · 2025-01-06T17:46:00Z

pkg/controllers/backup/controller.go

+			clusterOriginChanged := h.prepareClusterOriginConditions(backup)
+			if clusterOriginChanged {
+				shouldUpdateStatus = true
+			}


nit : can simplify using

if originChanged := h.prepareClusterOriginConditions(backup); originChanged{ shouldUpdateStatus = true }

alexandreLamarre · 2025-01-06T17:49:06Z

pkg/util/util.go

+	if err != nil {
+		logrus.Infof("Error fetching CRD: %v", err)
+		return false
+	}


In a more ready version of this, I'd expect any errors like this to be surfaced in the setup step. And potentially exit since an error like that signifies a more serious problem or a transient error (in this case could consider a backoff mechanism)

alexandreLamarre · 2025-01-06T17:56:52Z

pkg/apis/resources.cattle.io/v1/types.go

@@ -37,6 +39,7 @@ type BackupSpec struct {

 type BackupStatus struct {
 	Conditions         []genericcondition.GenericCondition `json:"conditions"`
+	OriginCluster      string                              `json:"originCluster,omitempty"`


I'm not convinced the origin cluster metadata should be stored in the Status field.

I can see from the original CRD definition that the semantic difference between spec (the desired state of the object) and the status (what is the current state of the object) is a little bit muddied in this operator.

IMO, the origin cluster (and additional metadata) could be set in the Spec and be fully managed by the operator

In any case, the backup CRD's status should be the one surfacing whether or not a backup is viable for sure 👍

alexandreLamarre · 2025-01-06T18:00:55Z

pkg/util/util.go

+	crd, err := getCRDDefinition(client, crdName)
+	if err != nil {
+		logrus.Infof("Error fetching CRD: %v", err)
+		return false
+	}
+
+	// Inspect the status schema, for example
+	_, found := crd.Spec.Versions[0].Schema.OpenAPIV3Schema.Properties["status"].Properties["originCluster"]
+	if found {
+		logrus.Debugf("Status schema contains `originCluster` on CRD `%s`.\n", crdName)
+		return true
+	}


Going back to the above discussion, storing the cluster metadata in the backup CRD spec as a nullable obejct shouldn't be a breaking change, and we can instead check if it is nil or not, rather than doing dynamic CRD reflection

IMO it is easier to make it testable if we avoid checking crd definitions and focus on the presence/absence of a new field to determine that.

alexandreLamarre · 2025-01-06T18:03:53Z

pkg/controllers/backup/cluster_origin.go

+}
+
+// prepareClusterOriginConditions helps set the cluster origin conditions and reports if anything changed in this part of status.
+func (h *handler) prepareClusterOriginConditions(backup *v1.Backup) bool {


I think a lot of this logic in this function can also be simplified/more readable if we encapsulate most of it in struct methods

alexandreLamarre · 2025-01-06T18:07:09Z

pkg/controllers/backup/controller.go

-	kubeSystemNS            string
+	// TODO: rename to kubeSystemNamespaceUID; nit to improve clarity, it's not the string representation nor the NS resource
+	kubeSystemNS              string
+	canUseClusterOriginStatus bool


stylistically, should avoid storing additional boolean fields, in favor of a method.

I think this field might be useless?

If we opt into using the backup CRD spec for storing cluster metadata, and older backup-restore versions don't set this new null-able field, then we cannot determine where it is from.

alexandreLamarre · 2025-01-06T18:10:13Z

pkg/controllers/backup/cluster_origin.go

+		currentOriginConditionString := condition.Cond(v1.BackupConditionClusterOrigin).GetStatus(backup)
+		if currentOriginConditionString != "False" {
+			condition.Cond(v1.BackupConditionClusterOrigin).SetStatusBool(backup, false)
+			condition.Cond(v1.BackupConditionClusterOrigin).Message(backup, "CRD not updated to include cluster UID yet.")


When would this realistically happen?

Yeah, at least I think so. Granted it was during devworkflows, but I did run into it myself.

Maybe not as big a concern in prod, but I think could still happen - in a chart like BRO if they don't also update the CRDs. I.e. Installed via helm CLI directly.
Or even if (an edge case) where a user manually targets a newer image than the chart. However if we opt to not change the CRD to add it to status this wouldn't be needed anyway.

mallardduck requested a review from a team as a code owner October 31, 2024 23:27

mallardduck force-pushed the bro-metadata-context branch 2 times, most recently from 4967e19 to f0b6b30 Compare November 1, 2024 01:26

mallardduck added 10 commits October 31, 2024 21:38

add todos for later refactor PR

08af6f3

Add new annotation for tracking a backups origin cluster ID

4c40185

Add new BackupConditionClusterOrigin and fix whitespace

140a194

add new BackupConditionInPlaceRestore and fix whitespace

ad5b6dd

refactor to more expressive varname

8ceba36

Add new Status conditions for the Cluster/InPlace conditions

4c9237a

Add todo related to new feature

e63ec61

lint

6b62c5e

Add new status for originCluster reference

a84e5f9

add TODO about potential enhancement

9ac7ecb

mallardduck force-pushed the bro-metadata-context branch from 4040156 to 9ac7ecb Compare November 1, 2024 01:38

Add "In-Place" column to kubectl get backups

dfc646a

mallardduck added the DO NOT MERGE label Nov 1, 2024

mallardduck requested a review from jbiers November 1, 2024 16:55

mallardduck requested a review from alexandreLamarre December 9, 2024 17:33

alexandreLamarre reviewed Jan 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[main] Add new method to track originating cluster of Backup #613

[main] Add new method to track originating cluster of Backup #613

mallardduck commented Oct 31, 2024 •

edited

Loading

mallardduck commented Nov 1, 2024 •

edited

Loading

mallardduck commented Dec 9, 2024

alexandreLamarre left a comment

alexandreLamarre Jan 6, 2025 •

edited

Loading

alexandreLamarre Jan 6, 2025

alexandreLamarre Jan 6, 2025 •

edited

Loading

alexandreLamarre Jan 6, 2025

alexandreLamarre Jan 6, 2025

alexandreLamarre Jan 6, 2025

alexandreLamarre Jan 6, 2025

alexandreLamarre Jan 6, 2025

alexandreLamarre Jan 6, 2025

mallardduck Jan 6, 2025

[main] Add new method to track originating cluster of Backup #613

Are you sure you want to change the base?

[main] Add new method to track originating cluster of Backup #613

Conversation

mallardduck commented Oct 31, 2024 • edited Loading

mallardduck commented Nov 1, 2024 • edited Loading

mallardduck commented Dec 9, 2024

alexandreLamarre left a comment

Choose a reason for hiding this comment

alexandreLamarre Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexandreLamarre Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mallardduck commented Oct 31, 2024 •

edited

Loading

mallardduck commented Nov 1, 2024 •

edited

Loading

alexandreLamarre Jan 6, 2025 •

edited

Loading

alexandreLamarre Jan 6, 2025 •

edited

Loading